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Abstract. We present algorithms for length-constrained maximum sum 
segment and maximum density segment problems, in particular, and the 
problem of finding length-constrained heaviest segments, in general, for 
a sequence of real numbers. Given a sequence of n real numbers and two 
real parameters L and U (L ^ U), the maximum sum segment problem 
is to find a consecutive subsequence, called a segment, of length at least 
L and at most U such that the sum of the numbers in the subsequence is 
maximum. The maximum density segment problem is to find a segment 
of length at least L and at most U such that the density of the numbers 
in the subsequence is the maximum. For the first problem with non- 
uniform width there is an algorithm with time and space complexities 
in 0(n). We present an algorithm with time complexity in 0(n) and 
space complexity in 0(U). For the second problem with non-uniform 
width there is a combinatorial solution with time complexity in 0(n) 
and space complexity in 0(U). We present a simple geometric algorithm 
with the same time and space complexities. 

We extend our algorithms to respectively solve the length-constrained k 
maximum sum segments problem in 0(n + k) time and 0(max{(7, k}) 
space, and the length-constrained k maximum density segments problem 
in 0(n min{fc, U — L}) time and 0(U + k) space. We present extensions 
of our algorithms to find all the length-constrained segments having user 
specified sum and density in 0(n + in) and 0{n log(!7 — L) + m) times 
respectively, where m is the number of output. Previously, there was no 
known algorithm with non-trivial result for these problems. We indicate 
the extensions of our algorithms to higher dimensions. All the algorithms 
can be extended in a straight forward way to solve the problems with 
non-uniform width and non-uniform weight. 

The algorithms have potential applications in different areas of biomolec- 
ular sequence analysis including finding CG-rich regions, TA and CG- 
deficient regions, CpG islands and regions rich in periodical three-base 
patterns, post processing sequence alignment, annotating multiple se- 
quence alignments, and computing length constrained ungapped local 
alignment. They also have applications in other areas such as pattern 
recognition, digital image processing and data mining. 
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1 Introduction 

In this paper we study two problems concerning the determination of length- 
constrained heaviest segments in a sequence of real numbers. The problems in 
their basic form are described below: 

Definition 1. Given a sequence of pairs of real numbers (a^, Wi), i = 1, 2, . . . , n, 
with Wi > 0, and another pair of real numbers L ^ U. (a) The maximum 
sum segment problem is to find a subsequence ai,ai+\, . . . , a j whose sum + 
a.; + i + ... + a,j is the maximum under the constraint that L ^ Wi+ + . . . + 
Wj < U. (b) The maximum density segment problem is to find a subsequence 
ai, ctj+i, . . . ,aj whose density (a^ + a^+i + . . . + aj)/{wi + Wi + \ + . . . + uij) is the 
maximum under the constraint that L ^ Wi + w i+1 + . . . + wj ^ U. 

The maximum sum segment problem, with uniform width (wi = 1 for all i) 
and no restriction on segment length, was formulated by Ulf Grenander [22,6]. 
He found it in the area of pattern recognition in digitized image. The original 
problem, as proposed by him, was in 2-dimensions. In that setting the maximum 
sum subarray was the estimator for the maximum likelihood of a pattern in 
digital image. He also simplified the problem to 1-dimension. The problem also 
has applications in other areas such as graphics, data mining [1,17,18] and 
bioinformatics [4]. An optimal linear time algorithm for the problem proposed 
by Jay Kadane is described by Bentley [6] and Gries [23]. Its space complexity 
is 0(1). The two dimensional version of the problem is to find a connected 
rectangular submatrix of maximum sum from a two-dimensional rectangular 
input matrix of real numbers [6]. Here the widths are uniform, i.e., Wi = 1 for 
all i, and there is no restriction on the size of the submatrix. The problem has 
been extended to higher dimensions [38]. In higher dimensions the problem is 
called the maximum sum subarray problem. The higher dimensional problem has 
applications in the area of data mining (dimensions being less than 4) and Monte- 
Carlo simulation (dimensions being high) [38] . It can be solved by reducing it to 
1-dimensional problems [7, 38]. For 2-dimensional m x n matrix there are 0(m 2 ) 
column intervals. Each of them is solved using Kadanc's linear time algorithm for 
maximum sum segment problem. Hence, its time complexity is 0{m 2 n) [7,38]. 
For this case, i.e., 2-dimensions with uniform width and no restriction on length, 
there is a better algorithm based on distnace matrix multiplication technique [38, 
37]. Its running time is subcubic. 

Huang [25] introduced the restriction of length cut off L in the setting of 
biomolecular sequence analysis to avoid reporting extremely short segments. He 
gave a linear time algorithm for computing the maximum sum segment of length 
at least L, but no restriction on the upper bound of its length, i.e., U = n. He had 
observed that the segments reported by the algorithm are usually much larger 
than L. From this observation Lin et al. [28] argued that the segments reported 
by the method may contain some poor and irrelevant segments. To avoid this 
they introduced the restriction of upper bound U on the length of the segment. 
They proposed a linear time algorithm for the problem when there is only the 



Algorithms for the Problems of Length-Constrained Heaviest Segments 3 

upper bound U on the length of the segment, but no lower bound, i.e., L = 0. 
They combined that algorithm with Huang's [25] technique to develop a linear 
time algorithm for arbitrary L and U. Its space requirement is also linear. In this 
paper, we present an algorithm for this general problem with time complexity in 
0(n) and space complexity in 0(U). We indicate the extension of this algorithm 
to solve the problem in higher dimension by using the technique of reducing the 
problem to 1-dimcnsion [7, 38] . 

The k maximum sum segments problem was introduced by Bae and Takaoka [5] . 
There was no restriction on the segment length. A natural extension of this prob- 
lem is the k length-constrained maximal sum segments problem. The problem 
is defined as follows: 

Definition 2. Given a sequence of real numbers di, i — 1,2, . ..,n, a pair of 
real numbers L U and an integer k such that 1 $J k [n — U + 1)(U — L + 
1) + \{U — L){U — L + 1). The k length- constrained maximum sum segments 
problem is to find k subsequence of consecutive elements of length at least L and 
at most U such that their sums are the k largest among all the possible segments 
of length at least L and at most U. 

When there is no restriction on segment length, i.e., L — and U = n, Brodal 
and Jorgensen [9] gave an optimal linear time, i.e., 0(n + k) time, algorithm for 
this. Their algorithm constructs a partially persistent [13] binary maximum heap 
that implicitly contains all the + n number of sums for all possible segments 
in 0(n) time. The heap is a modified version of the self-adjusting heap of Sleator 
and Tarjan [34]. The k maximum sums are selected from the heap using linear 
time heap selection algorithm of Frederickson [16]. Brodal and Jorgensen [9] 
extended their algorithm to higher dimension by using the technique of reducing 
the problem to 1-dimension [7, 38] . 

Combining with their technique we extend our algorithm for the maximum 
sum segment problem to solve the k length-constrained (i.e., arbitrary L and U) 
maximum sum segments problem. Its time and space complexities are 0(n + k) 
and 0(U + k) respectively. Previously, there was no known algorithm with non- 
trivial result for this case. 

For the maximum density segment problem, when the widths are uniform 
and there is no restriction on the segment length the maximum element in the 
sequence will be the solution and it can be found in a straight forward way in 
n— 1 comparisons and O(l) space. When U = L the problem is trivially solvable 
in 0(n) time since there are n — U + 1 feasible segments. When the widths are 
uniform, U ^ L and no upper bound (U > n — L) Huang [25] showed that 
the length of the maximum density segment is at most 2L — 1. So, this case is 
equivalent to the case when U = 2L — 1 and can be solved in 0{nL) using brute 
force method since the number of feasible segments is 0(nL). For this case Lin 
et al. [28] gave an O(nlogL) time algorithm by using a method of right skew 
decomposition of the sequence. When the widths are uniform, and U and L are 
arbitrary Goldwasser et al. [20] gave an 0(n) time algorithm. For the general 
case, where the widths are not uniform and U and L are arbitrary, Goldwasser 
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et al. [21] extended the right skew decomposition method of Lin et al. [28] to 
develop a 0(n)-time and space algorithm. A combinatorial solution with time- 
complexity in 0(n) and space complexity in 0(U) was proposed by [11]. The 
algorithm works in an online manner. In the same paper it was pointed out that 
the linearity claim of a geometric approach by Kim [27] is flawed. In this paper 
we modify Kim's algorithm to address the flaw, while retaining the simplicity, 
elegance and linearity of his geometric approach. Our algorithm's time and space 
complexity are 0(n) and 0(U) respectively, and it works in an online manner. 1 

The k maximum sum segments problem was introduced by Bae and Takaoka [5] . 
A natural extension of this problem is the k length-constrained maximal density 
segments problem. The problem is defined as follows: 

Definition 3. Given a sequence of real numbers di, i = 1,2, ... ,n, a pair of 
real numbers L ^ U and an integer k such that 1 ^ k ^ (n — U + l)(U — L + 
1) + \ {U — L){U — L + 1). The k length- constrained maximum density segments 
problem is to find k subsequence of consecutive elements of length at least L 
and at most U such that their densities are the k largest among all the possible 
segments of length at least L and at most U. 

Combining with the technique of Brodal and Jorgensen [9] we extend our 
algorithm for the maximum density segment problem to solve the k length- 
constrained (i.e., arbitrary L and U) maximum density segments problem. Its 
time and space complexities are 0(n + k) and 0(U + k) respectively. Previously, 
there was no known algorithm with non-trivial result for this problem. 

Huang [25] introduced the problem of finding segments of a sequence satis- 
fying a sum requirement. The content requirement is expressed as the count of 
equal length oligomers in biomolecular sequence. We shall call this problem as 
the required sum segments problem. A natual extension of this is the required 
density segments problem. The problems are defined as: 

Definition 4. Given a sequence of real numbers a^, i = 1, 2, . . . , n, a real num- 
ber d, and another pair of real numbers L ^ U . (a) The required sum segments 
problemm is to find all the subsequences ai, o»+i, . . . ,a,j of length at least L and 
at most U such that at + a i+ i + . . . + aj > d. (b) The required density segments 
problemm is to find all the subsequences aj, aj+i, . . . , aj of length at least L and 
at most U such that (a, + dj+i + . . . + aj)/(wi + w^+i + . . . + Wj) > d. 

For the required sum segments problem when there is only lower bound on 
the length of the sequence and no upper bond on its length Huang [25] gave a 
linear time algorithm for the problem using dynamic programming technique. We 
describe an extension of our algorithm for the maximum sum segment problem 
to solve this problem for the general case, i.e., when L and U are arbitrary, in 
linear time. Previously, there was no known algorithm with non-trivial result for 
this case. 

1 A talk was given on the algorithm in the 20th Annual Fall Workshop on Computa- 
tional Geometry 2010 [2]. 
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Combining with the technique of Brodal and Jorgensen [9] we extend our al- 
gorithms for the maximum sum segment problem and the maximum density seg- 
ment problem to solve respectively the required sum segments problem and the 
required density segments problem. Their time complexities are 0(max{n, to}) 
and 0(max{nlogL, to}) respectively, where to is the number of output. Previ- 
ously, there was no known algorithm with non-trivial result for these problems. 

All of our algorithms can be used to solve the corresponding higher dimen- 
sional problems by reducing them to 1-dimensional problem in the way described 
by [7, 38] . They can also be extended to solve the problems with non-uniform 
width and non-uniform weight. We note that for k maximum sum segments 
problem there is another version of the problem where there is no restriction on 
the segment length (i.e., L = and U — n) but the segments are not allowed to 
overlap. For this case there are linear time algorithms for 1-dimcnsion [10,31]. 
In this paper, we shall not pursue this line. In all of our algorithms in this paper 
the segments are allowed to overlap. 

According to [29, 36] the compositional heterogeneity of a genomic sequence 
is strongly correlated to its CG content regardless of the size of the genome. 
It is also found that gene length[14], gene density [41], patterns of codon us- 
age[32], distribution of different types of repetitive elements [14, 35], number of 
isochores[8], length of isochore[29] and recombination rate within chrosome[19] 
are related to CG content. The algorithms can be used directly to find length 
constrained CG-rich regions with the maximum sum and average or with some 
user specified content requirement in a DNA sequence. 

The nucleotide composition of a newly determined DNA sequence is ana- 
lyzed to locate its biologically meaningful segments including finding CG-rich 
regions [15,24], TA and CG-deficient regions [30], CpG islands [24], regions rich 
in periodical three-base pattern [33, 39], post processing sequence alignment [40], 
annotating multiple sequence alignments [36] and computing length constrained 
ungapped local alignment [3] . Our algorithms have potential applications in those 
areas. 

In Section 2 we briefly describe Kim's [27] algorithm for the maximum density 
segment problem. Our algorithms for the maximum density and sum segments 
of a sequence are presented in Sections 3 and 4 respectively. Section 5 describes 
our algorithms for finding all the segments of a sequence having sum or density 
of at least some user specified value. Concluding remarks are given in Section 6. 

2 Kim's Algorithm for Maximum Density Segment 

We describe Kim's [27] algorithm for the maximum density segment problem us- 
ing uniform width. He reduced the problem to a geometric one thus. Let P[i] = 
a i + a 2 + • • • + a i, be the zth prefix-sum of the given sequence S : ai,a 2 , ■ ■ ■ ,a n 
(define P[0]=0). This gives n + 1 points in the plane po = (0, P[0]),pi = 
(l,P[l]),j?2 = (2, P[2]), . . . ,p n = (n,P[n\), sorted by their x-coordinates. The 
density of a segment Oj, a^+i, . . . , aj can then be interpreted as the slope of the 
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line segment through the points (i — 1, P{i — 1]) and (j, P[j])- The problem then 
is to find pi and pj such that pipj has the largest slope. 

Without any restriction on the segment length, the maximum density prob- 
lem is solved by computing the largest slope defined by a pair of the above 
points. We can use any of a number of O(nlogn) slope selection algorithms for 
this problem ([12] or [26] for example). The constraints on the segment length 
add a twist to the problem. 

For a given right endpoint Pj , the set of candidate left endpoints Pi has i in the 
index-window Ij = [0, j — L + l] when L s; j < U and in Ij = \j — U + l,j — L + l] 
when j > U. If we maintain the lower convex hull of the points in this index- 
window, then the largest slope is found by drawing a tangent from pj to a 
point p t on this lower hull. The maximum density segment for a fixed j is then 
a 4+ i, . . . , dj. As j goes from I to n the maximum of all slopes found gives the 
desired maximum density segment. 

Based on the above formulation, Kim proposed an algorithm that claimed 
to be able to perform all the dynamic updates to the lower convex hull as the 
index-window moves from the left to the right in 0{n) amortized time. This 
claim is marred by the following problem. Figure 1 shows the lower convex hull 
(Ich, for short) of the points inside the index-window where p x ,p z and p y 
are the leftmost, bottommost and rightmost points on the Ich. Kim maintains 
the portion of Ich from p y to p z in one array and the portion of the Ich from p x 
to p z in another array. 

Now, it is crucial to the correctness of Kim's algorithm that, as the window 
Ij slides to the right the algorithm remains updated about the new value of p z . 
Kim's algorithm correctly updates p z , except in the case shown in Figure 1(b). 



Py 




Pz 



h 

Fig. 1. The lower convex hull of the points in the index-window Ij 

In this the window slides to the next position the hull update cannot 

be done in O(l) time as Figure 1(c) shows. 

The blind-spot in Kim's algorithm is that as the index-window Ij slides to 
the right, the lowest point p z under the tacit assumption that the situation as 
described above never arises. It can as shown above. 
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Fig. 3. p z may need to be recomputed 

3 Our Algorithm for Maximum Density Segment 

First, we describe our algorithm for the case of uniform width, i.e., Wi = 1 
for i = 1, ...,n. The main idea underlying the new algorithm is to consider the 
right end point pj (for j = L, L + 1, . . . , n) of a candidate largest slope segment 
piPj in batches of a fixed size. For each pj, instead of computing a single lower 
convex hull of the feasible set of left points p i: we compute two lower convex 
hulls - a left one and a right one that are joined at a common extreme point 
Pk, j — U + 1 < k ^ j — L + 1 (Fig. 4). The right lower hulls are computed 
incrementally in a lcft-to- right (LR) pass for the batched set pj, and the left 
hulls in a right-to-left (RL) pass for the same batched set. Thus the problem 
that arises in Kim's algorithm from the dynamic convex hull update as a result 
of deletion on the left is avoided. The correctness of this scheme is due to the 
following easily-proved lemma. 

Lemma 1. For a point Pj,L ^ j ^ n, let the candidate left end points pi, 
i G [j — U + 1, j — L + 1] of all feasible segments be divided into two groups: 
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Gi = {pi\i £ \j-U + l,j-U + k]} andG 2 = {pi\i £ \j - U + k, j - L + 1]}, with 
1 ^ k ^ U — L + l. Then the maximum slope of a segment p[p~J with pi £ G\ U G 2 
is the maximum of two maximum slopes obtained by restricting pi to be first in 
G\ and then in G 2 - 

The right end points pj are considered in batches of size U — L + l, where 
j £ [k, k + U — L] and k > U. The details of the LR and RL passes are as follows. 

3.1 The LR pass 

We incrementally compute the convex hulls CH({pk-L+i,Pk+i-L+i, ■ ■ ■ iPj-L+i)}), 
for j = k, . . . , k + U — L. Following Kim [27], we maintain 3 parameters to aid 
the incremental computation : the maximum slope fj, found so far, a tangent line 
I to the current hull with slope fi and the point of contact a of I with the current 
hull. 

Initially, I = Pk-L+iPk, t 1 = slope(l) and a = Pk-L+i- For j = k + t, where 
^ t < U — L, we update the right lower hull, p, I and a according to the 4 
cases below: 

Case 1 Pk-L+i+t and pk+t are both above I (Fig 5). 

The (right) hull is updated, a is set to the point of contact of the tangent 

from pk+t to this new hull, while this tangent and its slope are set to be the 

new / and \x respectively. 
Case 2 pt-L+i+t is above I and pt+t is below I (Fig 6). 

The (right) hull is updated. However /i, a and I remain unchanged. 
Case 3 Pk-L+i+t is below I (Fig 7). 

The (right) hull is updated. Let I' be a line through Pk-L+i+t parallel to I. 

Let p k+t be above I'; reset / = Pk-L+i+tPk+u P = slope(l) and a = Pk-L+i+t- 
Case 4 pt-L+i+t is below I and pk+t is below V (Fig 8). 

The (right) hull is updated. Set / to V and a = pt-L+i+t 

As Pk-L+i+t and hence p k+t both move right a never moves to the left. So, the 
cost for this pass is linear in the number of p/s considered. 

3.2 The RL pass 

This pass needs more careful handling. We incrementally compute the convex 
hulls CR({p k -L+i,Pk+i-L-i, ■ ■ ■ ,pj-u+i}) for j = k + U - L . . . k. 
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Pk+l 

Pk-L+l+t 




Fig. 5. Both pk-L+i+t and Pk+t are above I 



Pk-L+l+l 




Fig. 6. Pk-L+i+t is above while Pk+t is below I 

Initially, I = p k -L+iPk+u-L, P = slope(l) and a = Pk-L+i- For j = k + U - 
L — t, where < t < U — L, we update the left lower hull according to the 4 
cases below: 

Case 1 Pk-L+i-t and Pk+u-L-t are both above I (Fig. 9). 

The (left) hull is updated. In this case the update does not go left beyond a 
on the current hull. We traverse the convex hull counterclockwise from a to 
check if a tangent can be drawn to it from Pk+u-L-t that has a larger slope 
than p. If so we reset a, p and I appropriately. 
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Fig. 7. pk-L+i+t is below I and Pk+t is above I' 




Fig. 8. Pk-L+i+t is below I and Pk+t is below V 



Case 2 Pk-L+i-t is above I and Pk+u-L-t is below I (Fig. 10). 

The (left) hull is updated. However /x, a and I remain unchanged. 

Case 3 Pk-L+i-t is below I and Pk+u-L-t is above I (Fig. 11). 

The (left) hull is updated to a vertex beyond a. We traverse the updated 
hull from the newly added point counterclockwise to check if a a tangent can 
be drawn to it from Pk+u-L-t that has a larger slope than fi. If so we reset 
a, [i and I appropriately. In this case, a can move left to the newly added 
point. 

Case 4 pt-L+i-t is below I and Pk+u-L-t is below I' (Fig. 12). 

The (left) hull is updated as in Case 3. We check if the join of Pk-L+i-t and 
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Fig. 9. pk-L+i-t is above I and pt+u-L-t is above I 




/' 



Pk+U-L-t 

Fig. 10. Pk-L+i-t is above I and Pk+u-L-t is below I 

Pk+u-L-t has a larger slope than [i (this is easily done by discriminating 
with respect to a line through Pk-L+i-t parallel to I). In that case we reset 
a to pk-L+i-t, I to Pk-L+i-tPk+u-L-t and to the slope of the new I. 

As both Pk+u-L-t and Pk+L-i-t moves left a may move backward to the left 
at most once for each pj . So, the cost for this pass is linear in the number of Pj 's 
considered. 
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Pk-L+l 



Pk-L+l-l 



Fig. 11. pk-L+i-t is below I and pk+u-L-t is above I 




Pk-L+l 



Pk+U-L-l 

Fig. 12. pk-L+i-t is below I and Pk+u-L-t is also below I 
3.3 Analysis 

Initially, we find the maximum density for the segments with right end points 
Pj,j = L,L + 1, ...,U using the incremental lower convex hull in a right pass 
as described above. Then for each batch of right end points pj,j = U + b(U — 
L), U + b(U — L) + 1, U + (b + 1)(U — L), where 6 is an integer with values 
6=1, [(n — U)/{U — L)J, we find the maximum density for all the feasible 
segments having these points as right end points in two passes as described 
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above. For the residual set of the pj's we use an LR pass only. The indices of 
the subsequence whose density has the maximum slope from all these passes 
is returned as the result. It is clear from the way the computation has been 
organized that the time complexity is in 0(n). Thus, we have the following 
theorem: 

Theorem 1. Given a sequence A of n real numbers and two real numbers L 
and U with 1 ^ L ^ U ^ n, our geometric algorithm as described above finds 
the maximum density segment of A from among all the segments of A of length 
at least L and at most U in linear time in an online manner. 

The above algorithm for uniform widths can be extended to solve the gen- 
eral problem where the widths, U and L arc arbitrary. For this we define the 
cumulative width Wi 7 i = 0, ...,n, as Wo = and Wi = w\ + ... + Wi. Then the 
density fiij of a segment S(i,j) can be written as 

Pj — Pj-i 
''- U\ II, 

For each element cij, i = 1, n, in the sequence S we get the point (P[i], W[i]), 
i = 1, ...,n, in the plane. We have the initial point (P[0],VF[0]) = (0,0). Then 
the problem to find the feasible segments with the maximum sum is reduced 
to finding the feasible pairs of points with the maximum slope. And this can 
be solved by a simple modification to our above algorithm for uniform widths. 
The only difference is that the abscissas of the consecutive points P[i], W[i] and 
P[i + 1] , W[i + 1] arc iOj+i distance away instead of equal distance away. Its time 
and space complexity will remain 0(n) and 0(U) as before. Thus, we have the 
following theorem: 

Theorem 2. Given a sequence A ofn pairs of real numbers (di,Wi),i — 1, n, 
and two real numbers L and U with 1 < L < U < n, our geometric algorithm 
as described above finds the maximum density segment of A from among all the 
segments of A of length at least L and at most U in linear time in an online 
manner. 

Using the method of [7, 38] the 2-dimensional problem is reduced to (™) + m 
number of 1-dimcnsional problems, where the input is an m x n matrix. We solve 
each of them using the above algorithm. The time complexity will be 0(m 2 n). 

Theorem 3. Given a 2-dimensional m x n matrix A of pairs of real numbers 
{aijU>ij),i = l,...,m;j = l,...,n, and two real numbers L and U with 1 ^ L ^ 
U ^ n, there exists an algorithm to find the maximum density subarray of A 
from among all the subarrays of A of length at least L and at most U in 0(m 2 n) 
time and 0(mU) space. 

Proof. Omitted. 



The above algorithm can be extended to any dimension d in a straight forward 
way. 
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Theorem 4. Given a d- dimensional n\ x x ... x nd matrix A of pairs of 
real numbers and two real numbers L and U with 1 s; L s; U ^ n, there exists 
an algorithm to find the maximum density subarray of A from among all the 
subarrays of A of length at least L and at most U in 0(nillf =2 nf) time and 
0(UIlf =2 ni) space. 

Proof. Omitted. 

All of our remaining algorithms in this paper can be extended in a similar way 
to solve the corresponding higher dimensional problems by using the technique 
of reducing the problems to 1-dimcnsion [7, 38] . 

3.4 k Maximum Density Segments 

In this paper, we shall use the term heap to denote maximum heap ordered 
binary tree. First, we consider the case k < U — L. The above SPLITHULL 
algorithm is repeated k times for each batch of U — L + 1 points to find at least k 
maximum density segments with right end points in the batch. In each iteration 
the maximum density segment for the iteration is found. Keeping the left end 
point of this segment fixed all the valid segments with right end point within 
the current batch of U — L + 1 points are selected. Then the left end point is 
deleted at the start of the next iteration. All these maximum density segments 
are inserted into a heap H [34] . The k largest density segments are selected from 
H in 0{k) times using the binary heap selection algorithm of Frederickson [16]. 
The left pass of the algorithm for a batch of U — L + 1 points is described below. 



Algorithm MDS-SMALLK 

1. Let X and Y be the sets of U — L + 1 number of left and right end 
points of algorithm SPLITHULL. 
2. For i = 1 to k do. 

2.1. Find the maximum density segment from among the valid segments 
with left end points in X and right end points in Y using the left pass of the 
SPLITHULL algorithm. 

2.2. Let x and y be the left and right end points of this segment respec- 
tively. Insert (x,y,d(x,y)) in the skew heap H. 

2.3. For all valid segments with left end point x, i.e., (x,y'),y' e Y, insert 
(x, y 1 , d(x, y')) in the skew heap H. 

2.4. Delete x from X. 



In each iteration the algorithm will select at most U — L + 1 segments. In 
k iterations it will select at most k(U — L + 1) segments. The right pass of 
the algorithm for the same batch of points is similar. It uses the right pass of 
SPLITHULL. Thus the maximum number of points selected for the batch of 
U — L + 1 right end points is 2k(U — L + l). The cost of each insertion in H is in 
0(1). For the batch of U — L + 1 points, the total cost of selection of maximum 
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density segments and inserting them in H is 0(k(U — L + 1)). Thus the total 
cost per right end point is in O(k). 

Theorem 5. Given a sequence A of n real numbers, two integers L and U with 
1 ^ L ^ U ^ n, and one integer k — L, MDS-SMALLK algorithm finds the 
k maximum density segments of A from among all the segments of A of length 
at least L and at most U in O(kn) time in online manner. 

When k > U — L we do not use the SPLITHULL algorithm. Instead, for 
each of [ jJ _^ +1 ] number of batches of U — L + 1 consecutive elements of the 
sequence we insert into H all the possible segments with right end elements in 
those batches. Select 2k maximum segments from H, using Frederickson's heap 
selection algorithm. Then construct H using these segments. The total cost is 
clearly 0(U — L) for each right end point. At the end of processing of the last 
right end point we have found 2k maximum density segments. From these, we 
find the k-t\\ maximum in 0(k) time. Thus, we have the following theorem: 

Theorem 6. Given a sequence A of n real numbers, two integers L and U with 
1 ^ L ^ U ^ n, and one integer k > U — L, the above algorithm finds the k 
maximum density segments of A from among all the segments of A of length at 
least L and at most U in 0(n(U — L)) time in an online manner. 

4 Maximum Sum Segment 

We shall use a simple modification of Brodal and Jorgensen [9] method to solve 
the maximum sum problem. As before we solve the problem in batch mode with 
U — L number of elements in each batch as right end elements of the feasible 
segments being considered in a batch. Analogous to Lemma 1 for the maximum 
density segment problem we have the following trivial lemma: 

Lemma 2. For an element aj, L ^ j ^ n, let the candidate left end elements 
ai, i G [j — U + 1, j — L + 1] of all feasible segments be divided into two groups: 
Gi = {a t \i e [j - U + l,j - U + k}} and G 2 = {a l \i G [j -U + k,j - L + 1}}, 
with l^k^U — L+l. Then the maximum sum of all the feasible segments 
S(i, j) with ^ £ Gi U Gi is the maximum of the two maximum sums obtained 
by restricting ai to be first in G\ and then in G2 ■ 

So, for each batch of U — L right end points we make 2 passes as before. For 
a pass we shall use Brodal and Jorgensen [9] algorithm to construct a partially 
persistent [13] max-heap that implicitly contains all the feasible segments with 
their respective sums. It will take 0(U) time and space to build it. From this heap 
we select the maximum element in constant time. For each pass of each batch of 
U — L right end elements we update the maximum. To build the heap we note 
that the sets of all feasible segments have right end points aj , where Oj is one of 
the right end elements in the current batch of right end elements. Let the batch 
of elements be aj, aj + u-L, where j = k(U — L), k = 1, > U. Here 
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we describe the RL pass. The LR pass is similar and simpler. The incremental 
construction of the heap elements (5k+i, d' k+1 , H^f) ar e shown below: 

(8 j+1 ,5' j+1 ,Hi+}) = (S(j - U + L + 1, j - L + 1), S(j - U + L + 1), 0), 



(S m+ i, 6' m+1 , -ff™/ 1 ) = (5 m + a m+ i, S' m - aj- m , H r s n uf U {6' m - S m }). 

where m = j,j — 1, j — U + L — 1. Thus, we have the following theorem: 

Theorem 7. Given a sequence Aofn real numbers and two real numbers L and 
U with 1 ^ L ^ U ^ n, our algorithm as described above finds the maximum 
sum segment of A from among all the segments of A of length at least L and at 
most U in linear time in online manner. 

Proof. Omitted. 

By making a simple modification as was done for our algorithm for the 
maximum density segment problem for non-uniform case, the ALL-HEAVIER- 
SEGMENTS algorithm can solve the problem for the non-uniform case in the 
same time and space complexity. 

4.1 k Maximum Sum Segments 

To find the k maximum sum segments we extend the above algorithm according 
to Brodal and Jorgensen [9] . As before we work in batch mode with batch of fea- 
sible segments with U — L + l right end elements. For all the feasible segments in 
the batch we construct the partially persistent max-heap as before. We select the 
largest k elements from this heap in 0(k) time. We insert each of them into the 
self adjusting (skew) heap H l ~ x of Sleator and Tarjan [34] in amortized constant 
time. We select the largest k elements from this heap using Frederickson's [16] 
heap selection algorithm and insert them into a new skew heap H l . We delete 
the old skew heap . Thus, we have the following theorem: 

Theorem 8. Given a sequence A of n real numbers, two integers L and U with 
1 ^ L ^ U < n, and one integer k ^ U — L, there exists an algorithm to find 
the k maximum sum segments of A from among all the segments of A of length 
at least L and at most U in O(n) time and 0(U) space. 

When k > U — L we use the above algorithm for each of [ ;7 _^ +1 ] number 
of batches of U — L + 1 consecutive elements of the sequence we insert into 
H all the possible segments with right end points in that batch. Select the 2k 
maximum segment from the heap using Frederickson's heap selection algorithm. 
Then construct H using these segments. The total cost is clearly 0(U — L) for 
each right end point. At the end of processing of the last right end point we have 
found 2k maximum density segments. From them we find the fc-th maximum in 
O(k) time. Thus, we have the following theorem: 
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Theorem 9. Given a sequence A of n real numbers, two real numbers L and U 
with 1 ^ L ^ U < n, and one integer k > U — L, there exists an algorithm to 
find the k maximum sum segments of A from among all the segments of A of 
length at least L and at most U in 0(n + k) time and O(k) space. 

5 Finding All the Segments with Some Content 
Requirement 

In genomic sequence analysis sometimes it is necessary to find all the segments of 
a sequence with some user specified minimum sum or density requirements [2 5]. 
In this section we first describe the case of sum and then the case of density. 

5.1 Finding All the Segments with a Minimum Sum 

We construct the partially persistent self adjusting binary heap as before. From 
this heap we select all the segments having sum of at least the user specified 
amount. 

5.2 Finding All the Segments with a Minimum Density 

As before our algorithm work in batch mode. For each batch of U — L + 1 
number of right end points we find all the segments with left end points such 
that the density of each of them is at least the user specified minimum density. 
This is done in two passes as before. For each pass, the set of left end points 
of SPLITHULL algorithm is sorted incrementally with the new left end point 
corresponding to the new right end point being put into its sorted position. The 
relative positions of a pair of points are determined by the distance between 
the pair of parallel straight lines passing through them and having slope of user 
specified minimum density. For the new right end point the corresponding set 
of sorted left end points are searched to find a segment with density at least 
that of the given amount. Then all the segments with densities higher than that 
are selected. The left pass of the algorithm for a batch of U — L + 1 points is 
described below. In the algorithm X and Y are respectively the sets of U — L + 1 
number of left and right end points of algorithm SPLITHULL and d is the user 
specified density requirement. 



Algorithm ALL-HEAVIER-SEGMENTS(A:,T,d) 

1. Let X' = 0and S = cf>. 

2. For i = 1 to U - L + 1 do. 

2.1. Let Xi e Xand yi e Y be the new left and right end points re- 

spectively. Insert Zjinto X' in its sorted position i' according to the measure 
of relative position and the method of comparison described above. Let X be 
sorted in the increasing order of x-intercept of the line passing through the left 
end point and having slope of the user specified minimum density. 
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2.2. Search X' using binary search method to find the element xy € X'for 
which the segment (a;,/, yi) has density at least d. Let H . 

2.3. From X' select all the elements x'j in the part from Xi>to the end of 
X'. Insert (x'j,yi) in S for all such x'j. 

1 Return S. 

The right pass of the algorithm for the same batch of points is similar. Sorting 
the set of left end elements take 0((U — L) log(J7 — L)) time. For each right end 
point the binary search to find the left end point in the set X' take a maximum 
of 0(log(£7 — L)) time. Thus, the total time is (9(max(n \og(U — L),m), where 
m is the number of output. 

Theorem 10. Given a sequence A of n real numbers, two integers L and U 
with 1 < L < U sC n, and one real number d, ALL-HEAVIER-SEGMENTS 
algorithm finds all the segments of A of length at least L and at most U having 
density at least d in 0(max(nlog(J7 — L),m) time in online manner, where m 
is the number of output. 

By making the same modification as was done for our algorithm for the 
maximum density segment problem for non-uniform case, the ALL-HEAVIER- 
SEGMENTS algorithm can solve the problem for the non-uniform case in the 
same time and space complexity. 

6 Conclusions 

In this paper two problems concerning the search for the interesting regions in a 
sequence are considered. The problems are to find a consecutive subsequence of 
length at least L and at most U with the maximum sum and density respectively. 
We have presented linear time algorithms for both the problems. We have ex- 
tended our algorithms to find the k segments of length at least L and at most U 
with the largest sum and density. We have also extended our algorithms to find 
all the segments with user specified sum or density. We indicate the extensions of 
our algorithms to higher dimensions. Our algorithms facilitate efficient solutions 
for all these problems in higher dimensions. All the algorithms can be extended 
in a straight forward way to solve the problems with non-uniform width and 
non-uniform weight. 

The algorithms have applications in several areas of biomolecular sequence 
analysis including finding CG-rich regions, TA and CG-deficient regions, regions 
rich in periodical three-base pattern, post processing sequence alignment, anno- 
tating multiple sequence alignments and computing length constrained ungapped 
local alignment. 

It would be interesting to study if there is any linear time algorithm to find 
the fc-th density segment with length between the lower and upper bounds L 
and U respectively. It is also interesting to investigate if there is any linear time 
algorithm in the number of input and output to find all the segments with length 
between L and U and satisfying user specified minimum density requirement. 
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It can also be investigated to find more efficient algorithms for the problems in 
higher dimensions. It remains open to improve the trivial lower bounds for these 
cases. 
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